First, we will load all packages that we will use for our analyses.
library(ggplot2) # Data visualization
library(plotly) # Interactive data visualizations
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(psych) # Will be used for correlation visualizations
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(rattle) # Graphing decision trees
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(caret) # Machine learning
## Loading required package: lattice
library(klaR) # fOr nb
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
Load the iris data set. This data set is part of the base data sets built-in in R, hence, we do not need to load it externally.
data("iris")
Lets check the first 6 rows as well as the summary statistics of our data to get a feel of how the data looks.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Lets do some exploratory data analysis to visually investigate and find patterns and insight into our data. This step can help us understand the data better and prepare us for creating our model later.
Lets create a correlation matrix using the pairs.panels() function from the PSYCH package to see how our variables correlate.
pairs.panels(
iris[,1:4], # Our data.
scale = TRUE, # Changes size of correlation value lables based on strength.
hist.col = 'grey85', # Histogram color.
bg = c("mediumseagreen","orange2","mediumpurple1")[iris$Species], # Colors of the Species levels.
pch = 21, # The plot characters shape and size.
main = 'Correlation matrix of Iris data') # Title.
The upper part of the correlation matrix tells us the correlation between the variables and we can see that there is a moderate to strong correlation between all variables except for between sepal length and sepal width. Looking at the bottom half of the matrix gives us additional insight into not only the scatter plots of these correlations, but also divides the data points by iris species using colors. This allows us to see the clusters that are present among the species.
Now we will create an interactive 3D scatter plot using plot_ly() from the PLOTLY package.
As our 3D plot is, well, 3 dimensional, we can only feed it 3 variables for the axes. As we have 4 variables I have decided to disclude sepal width from our plot as it seems to be the least important variable when looking at the individual variables (which we can see in tabs 3-6).
plot_ly(data = iris, # Data
x = ~Sepal.Length, y = ~Petal.Length, z = ~Petal.Width, # X, Y, and Z variables. Put a tilde sign (~) before each variable name.
color = ~Species, # Color separation by Species.
type = "scatter3d", # Use a 3D scatterplot.
mode = "markers" # Use markers.
) %>% # Pipe
layout(
scene = list(xaxis = list(title = 'Sepal length'), # Assign axes names.
yaxis = list(title = 'Petal length'),
zaxis = list(title = 'Petal width')))
This plot is very effective in showing us where our species place on the 3 variables, making it easier to gauge how close/apart they are clustered. We can, for example, see that the setosa data is in a quite isolated cluster, while versicolor and virginica, although also in quite clear clusters, show a slight overlap. These distinguishing clusters are a good sign for when we want to do our machine learning later as they will hopefully help our model in making good predictions.
Box plot of sepal width:
ggplot(
# (1) Set data; (2) Specify X and Y variables; (3) 'fill' color separates our Species levels.
data = iris, mapping = aes(x = Species, y = Sepal.Width, fill = Species)) +
geom_boxplot() + # Specifies that we want a box plot.
scale_fill_brewer(palette = 'Dark2') + # Change color of box plots.
theme_light() + # Set light theme.
labs(title = 'Box plot of sepal width for each species',
x = 'Species', y = 'Sepal width') # Assign a title, axis names.
From this box plot it can be seen that the setosa species has a higher sepal width median and interquartile range compared to the other two species. In contrast, the Versicolor and Virginica show quite a bit of overlap with each other in term of their interquartile range. This will make it harder for a machine learning algorithm to distinguish between the two species levels when predicting using this variable.
(For a detailed explanation of how to create the box plot check the third tab.)
Box plot of sepal length:
ggplot(data = iris, mapping = aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
scale_fill_brewer(palette = 'Dark2') +
theme_light() +
labs(title = 'Box plot of sepal length for each species',
x = 'Species', y = 'Sepal length')
The ranges of the three species seem to somewhat overlap on the sepal length variable. However, their medians seem like they differ (at least visually. We don’t know if they’re statistically significantly different–we could test for this, but it is not necessary for the purposes of this kernel).
(For a detailed explanation of how to create the box plot check the third tab.)
Box plot of petal width:
ggplot(data = iris, mapping = aes(x = Species, y = Petal.Width, fill = Species)) +
geom_boxplot() +
scale_fill_brewer(palette = 'Dark2') +
theme_light() +
labs(title = 'Box plot of petal width for each species',
x = 'Species', y = 'Petal width')
This box plot seems to indicate quite a difference in petal width between the species.
Box plot of petal length:
ggplot(data = iris, mapping = aes(x = Species, y = Petal.Length, fill = Species)) +
geom_boxplot() +
scale_fill_brewer(palette = 'Dark2') +
theme_light() +
labs(title = 'Box plot of petal length for each species',
x = 'Species', y = 'Petal length')
This plot seems to indicate that the three species vary in terms of interquartile range on petal length. The setosa seems to have a very narrow interquartile range and have quite a lot shorter petal length compared to the other two species.
Before we begin to train our machine learning model we need to divide our data set into train and test subset. The train set is used for teaching or “training” our machine learning model how to make predictions with our data, while the test data set is used to test our model on data it has not seen in order to see how it performs.
First, lets set a random seed. This will ensure that we will be able to replicate our results when we rerun our analysis.
set.seed(222)
Next, we will create train/test partition with createDataPartition() from the CARET package. This will split our data through random sampling within our species levels and give us the sampled index.
train_index <- createDataPartition(y = iris$Species, # y = our dependent variable.
p = .7, # Specifies split into 70% & 30%.
list = FALSE, # Sets results to matrix form.
times = 1) # Sets number of partitions to create to 1.
Now we can split our data into train and test data using the randomly sampled train_index that we just formed.
train_data <- iris[train_index,] # Use train_index of iris data to create train_data.
test_data <- iris[-train_index,] # Use whatever that is not in train_index to create test_data.
Figure 1: Visualization of decision tree model.
Image source: Link
Model the decision tree model with a 10 fold cross validation.
fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
Create a predictor model with the train() function from the CARET package. Specify method = 'rpart' to run a decision tree model.
# Create model
dt_model <- train(Species ~ ., # Set Y variable followed by '~'. The period indicates to include all variables for prediction.
data = train_data, # Data
method = 'rpart', # Specify SVM model
trControl = fitControl) # Use cross validation
Check the predicted accuracy of our decision tree model by running it on resamples of our train data. Later we will test the accuracy of the model by running a prediction on our test data.
confusionMatrix(dt_model)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction setosa versicolor virginica
## setosa 33.3 0.0 0.0
## versicolor 0.0 29.5 5.7
## virginica 0.0 3.8 27.6
##
## Accuracy (average) : 0.9048
The results here tell us that our average accuracy is 90.48% when testing our data on resamples of our training data. We can also see what was predicted correctly/incorrectly.
Check the importance of each feature in our model.
# Create object of importance of our variables
dt_importance <- varImp(dt_model)
# Create plot of importance of variables
ggplot(data = dt_importance, mapping = aes(x = dt_importance[,1])) + # Data & mapping
geom_boxplot() + # Create box plot
labs(title = "Variable importance: Decision tree model") + # Title
theme_light() # Theme
This table gives us a very informative overview of the importance of each variable in predicting the species.
Now lets plot the decision tree using fancyRpartPlot() from the RATTLE package. This will give us clear insight into how the model makes its predictions.
fancyRpartPlot(dt_model$finalModel, sub = '')
This decision tree tells us that:
As we can see, petal length was the only variable that was needed to make these predictions.
Use the created dt_model to run a prediction on the test data.
prediction_dt <- predict(dt_model, test_data)
Check the proportion of the predictions which were accurate.
table(prediction_dt, test_data$Species) %>% # Create prediction table.
prop.table() %>% # Convert table values into proportions instead of counts.
round(2) # Round numbers to 2 significant values.
##
## prediction_dt setosa versicolor virginica
## setosa 0.33 0.00 0.00
## versicolor 0.00 0.31 0.00
## virginica 0.00 0.02 0.33
As can be seen, 98% of the species classifications were predicted correctly.
Quick facts:
Figure 2: Visualization of random forest model.
Image source: Link
fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
Create a training model using train() from the CARET package.
# Create model
rf_model <- train(
Species ~ ., # Set Y variable followed by "~." to include all variables in formula.
method = 'rf', # Set method as random forest.
trControl = fitControl, # Set cross validation settings
data = train_data) # Set data as train_data.
Use the varImp() function to grab the importance of each variable in our random forest model and then plot them.
# Create object of importance of our variables
rf_importance <- varImp(rf_model)
# Create box plot of importance of variables
ggplot(data = rf_importance, mapping = aes(x = rf_importance[,1])) + # Data & mapping
geom_boxplot() + # Create box plot
labs(title = "Variable importance: Random forest model") + # Title
theme_light() # Theme
This plot tells us that petal length and width are the most important variable for prediction in our model.
Now, lets check the predicted accuracy of the random forest model by running it on resamples of our train data. Later we will test the accuracy of the model by running a prediction on our test data, which is data our model has not seen before.
confusionMatrix(rf_model)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction setosa versicolor virginica
## setosa 33.3 0.0 0.0
## versicolor 0.0 29.5 2.9
## virginica 0.0 3.8 30.5
##
## Accuracy (average) : 0.9333
The output tells us that the predicted accuracy of our model is 94.29%. In addition, it also gives us a percentage matrix of species predictions against answers.
We will now use our created random forest model in order to predict species on our test data (i.e., the data set our ‘machine’ has not seen before).
Use the created rf_model to run a prediction on the test data.
prediction_rf <- predict(rf_model, test_data)
Check the accuracy of our random forest model on our test data.
table(prediction_rf, test_data$Species) %>% # Create prediction table.
prop.table() %>% # Convert table values into proportions instead of counts.
round(2) # Round numbers to 2 significant values.
##
## prediction_rf setosa versicolor virginica
## setosa 0.33 0.00 0.00
## versicolor 0.00 0.33 0.00
## virginica 0.00 0.00 0.33
100% of the species classifications were predicted correctly! It is important to note that this type of accuracy is rarely seen with real world data which is usually much more complex than our iris data set.
Figure 3: Bayes’ theorem.
Image source: Link
Model the Naive Bayes model with a 10 fold cross validation.
fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
Create a predictor model with the train() function from the CARET package. Specify method = 'nb' to run a Naive Bayes model.
# Create model
nb_model <- train(Species ~ ., # Set y variable followed by '~'. The period indicates that we want to use all our variables for prediction.
data = train_data,
method = 'nb', # Specify Naive Bayes model
trControl = fitControl) # Use cross validation
Check the predicted accuracy of our model by running it on resamples of our train data. Later we will test the accuracy of the model by running a prediction on our test data.
confusionMatrix(nb_model)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction setosa versicolor virginica
## setosa 33.3 0.0 0.0
## versicolor 0.0 29.5 2.9
## virginica 0.0 3.8 30.5
##
## Accuracy (average) : 0.9333
The results here tell us that our average accuracy is 95.24% when testing our data on resamples of our training data. We can also see what was predicted correctly/incorrectly.
Use the varImp() function to grab the importance of each variable in our random forest model and then plot them.
# Create object of importance of our variables
nb_importance <- varImp(nb_model)
# Create box plot of importance of variables
ggplot(data = nb_importance, mapping = aes(x = nb_importance[,1])) + # Data & mapping
geom_boxplot() + # Create box plot
labs(title = "Variable importance: Naive Bayes model") + # Title
theme_light() # Theme
This table gives us a very informative overview of the importance of each variable in predicting each species. We can see that petal width and length are the two most important variables for predicting each species.
Use the created nb_model to run a prediction on the test data.
prediction_nb <- predict(nb_model, test_data)
Check what proportion of the predictions which were accurate.
table(prediction_nb, test_data$Species) %>% # Create prediction table.
prop.table() %>% # Convert table values into proportions instead of counts.
round(2) # Round numbers to 2 significant values.
##
## prediction_nb setosa versicolor virginica
## setosa 0.33 0.00 0.00
## versicolor 0.00 0.31 0.00
## virginica 0.00 0.02 0.33
Here we can see that 98% of the species classifications were predicted correctly.
Quick facts:
Figure 4: Support vector machine hyperplane representation.
Image source: Link
Model the SVM model with a 10 fold cross validation.
fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
Create a predictor model with the train() function from the CARET package. Specify method = 'svmLinear' to run a SVM model.
# Create model
svm_model <- train(Species ~ ., # Set Y variable followed by '~'. The period indicates to include all variables for prediction.
data = train_data, # Data
method = 'svmLinear', # Specify SVM model
trControl = fitControl) # Use cross validation
Check the predicted accuracy of our naive Bayes model by running it on resamples of our train data. Later we will test the accuracy of the model by running a prediction on our test data.
confusionMatrix(svm_model)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction setosa versicolor virginica
## setosa 33.3 0.0 0.0
## versicolor 0.0 28.6 1.0
## virginica 0.0 4.8 32.4
##
## Accuracy (average) : 0.9429
The results here tell us that our average accuracy is 96.19% when testing our data on resamples of our training data. We can also see what was predicted correctly/incorrectly.
Use the varImp() function to grab the importance of each variable in our random forest model and then plot them.
# Create object of importance of our variables
svm_importance <- varImp(svm_model)
# Create box plot
ggplot(data = svm_importance, mapping = aes(x = svm_importance[,1])) + # Data & mapping
geom_boxplot() + # Create box plot
labs(title = "Variable importance: Support vector machine model") + # Title
theme_light() # Theme
This table gives us a very informative overview of the importance of each variable in predicting each species. We can see that petal length and petal width are the two most important variables for predicting each species.
Use the created svm_model to run a prediction on the test data.
prediction_svm <- predict(svm_model, test_data)
Check the proportion of the predictions which were accurate.
table(prediction_svm, test_data$Species) %>% # Create prediction table.
prop.table() %>% # Convert table values into proportions instead of counts.
round(2) # Round numbers to 2 significant values.
##
## prediction_svm setosa versicolor virginica
## setosa 0.33 0.00 0.00
## versicolor 0.00 0.33 0.00
## virginica 0.00 0.00 0.33
100% of the species classifications were predicted correctly. Nice!
Table of results:
| Machine learning model | Predicted Accuracy | Tested accuracy |
|---|---|---|
| Decision tree | 90.48% | 98% |
| Random Forest | 93.33% | 100% |
| Naive Bayes | 93.33% | 98% |
| Support Vector Machine | 94.29% | 100% |
For our data it looks like the random forest and support vector machine models performed the best.